QTM 447 Lecture 6: Adaptive Methods for Minimization

Kevin McAlister

January 30, 2025

Gradient Descent

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

When there is no way to analytically solve for the empirical risk minimizer, we need computational optimization methods to find the values of any unknown coefficients.

The most common approach when we can evaluate the gradient of the loss function at any candidate set of values is gradient descent:

\[\bb_{t + 1} = \bb_{t} - \eta_t \mathbf g(\bb_t)\]

  • \(\mathbf g(\bb_t)\) is the gradient of the loss function evaluated at the current value

  • \(\eta_t\) is the step size at iteration \(t\)

Stochastic Gradient Descent

For most problems, the loss is expressed in the form:

\[\expn \mathcal L(y_i , \hat{y_i})\]

Meaning that:

\[\mathbf g(\bb) = \expn \mathbf g_i(\bb)\]

  • \(N\) computations of \(P\) values

As \(N\) and/or \(P\) gets large, this does not scale particularly well!

Stochastic Gradient Descent

Instead, we can take advantage of the LLN and approximate the gradient at each step using a sample search

Set initial value, b
t = 0
gt = Gradient at b
B > 0
while gt != 0:
  Sample B obs from N
  Evaluate gradient at b as gt
  b = b - (eps*gt)
  t += 1

print(b) 

Stochastic Gradient Descent

Even when the batch size is 1, with sufficient learning rate schedule, we’ll land at the minimum

  • Plus some noise!

The benefit is a much less computationally intense optimization routine!

The cost is the noise.

Stochastic Gradient Descent

To demonstrate:

1000 observations of a basic binary classifier

50 features

Only 2 really matter! The rest are just noise.

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Both methods get where they need to go!

  • SGD gets there a little quicker but dances around.

In both cases, though, the move from start to minimum is quite inefficient

  • In the gradient descent case, it takes the full 10K to get there not actually reaching the stopping criterion

  • SGD is similar.

  • This means that the optimization path is making inefficient moves searching solely for the direction of steepest descent

  • Very sensitive to step size!

Stochastic Gradient Descent

Let’s start by examining regular ol’ gradient descent

  • Apply what we learn to SGD

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

A Method that Reduces Some Sensitivity and Improves Efficiency: Learning Rate Schedules

Rather than using a fixed step size, start with a large step size and temper the learning rate as the algorithm proceeds.

Most common scheduler: Polynomial Decay

\(\eta_t = \eta_0(b t + 1)^{-a}\)

  • Common parameter values: \(a = .5\) and \(b = 1\)

Stochastic Gradient Descent

Stochastic Gradient Descent

This schedule starts with large steps then quickly decays to smaller steps.

This make sense:

  • Our initial guess is probably pretty bad. Move away from it in a hopefully decreasing direction quickly.

  • Once we move onto a path that is decent, start moving a little slower.

  • At the end, we’re hopefully around the minimum and we’ll tiptoe towards the actual minimum

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Static learning rate schedules aren’t a solve all!

Stochastic Gradient Descent

Gradient descent moves quickly down the hill

But, then it hits a flat minimum

  • A trough in the loss surface

Moves very slowly.

Stochastic Gradient Descent

Even with a static learning rate schedule:

  • The moves through space are low

  • Require not starting too big

  • Jumpy if the choice of decay function isn’t just right

Second Order Methods

A first improvement to gradient descent leverages an alternative proof that it works - multivariate Taylor Series approximations.

Let \(\mathcal L(\bb_t)\) be the loss evaluated at coefficients at time \(t\). We can use a Taylor series expanstion to approximate the surface of the loss function in the neighborhoodd of the evaluation.

A first order approximation:

\[\mathcal L(\bb) = \mathcal L(\bb_t) + \mathbf g(\bb_t)^T (\bb - \bb_t) + \mathcal O(1)\]

A second order approximation:

\[\mathcal L(\bb) = \mathcal L(\bb_t) + \mathbf g(\bb_t)^T (\bb - \bb_t) + \frac{1}{2} (\bb - \bb_t)^T \mathbf H(\bb_t) (\bb - \bb_t) + \mathcal O\left(\frac{1}{2}\right)\]

Second Order Methods

If our goal is to find the smallest value of the loss, then it suffices to use the Taylor series approximation to minimize the loss

  • Find a new value within \(\eta\) of \(\bb_t\) that yields the minimum loss

For the first order approximation:

\[\bb_{t + 1} = \bb_{t} - \eta \mathbf g(\bb_t)\]

For the second order approximation:

\[\bb_{t + 1} = \bb_{t} - \eta \mathbf H(\bb_t)^{-1} \mathbf g(\bb_t)\]

Second Order Methods

Why might the second order approximation work better?

  • The error of the approximation is lower leading to a better guess as to the next move

Pre-conditioning on the inverse Hessian deskews the loss space

  • Takes into account the curvature of the loss function space instead of straight-line shooting from each spot

Second Order Methods

Recall that the gradient for logistic regression is:

\[\mathbf g(\bb_t) = \expn \mathbf x_i(\sigma(\mathbf x_i^T \bb_t) - y_i)\]

And the Hessian:

\[\mathbf H(\bb_t) = \expn \mathbf x_i \mathbf x_i^T\sigma(\mathbf x_i^T \bb_t)(1 - \sigma(\mathbf x_i^T \bb_t))\]

Second Order Methods

Second Order Methods

Second Order Methods

Second Order Methods

Second Order Methods

The first order gradient descent method only makes moves that are orthogonal to the level sets

  • From the start, reach the next level set as quickly as possible

The second order method is able to ignore the level sets and make moves that go in other directions!

  • Make big moves at the start to try to find a point orthogonal to the minimum - not the level sets

  • Once we get that point, immediately dive to the minimum

Second Order Methods

Called Newton’s Method and is the top performer when the loss space is strictly convex

  • Takes way less iterations to reach the minimum than gradient descent!

But, takes way more time per iteration.

Any thoughts as to why?

Second Order Methods

The big problem - matrix inversion!

The most efficient matrix inversion methods scale as \(\mathcal O(P^3)\)-ish.

  • Every new feature we add leads to a cubic increase in time needed!

  • Won’t scale well to high-dimensional data

Better First Order Methods

We’re quickly going to grow out of the realm where it is viable to compute the Hessian

  • It’ll also be really expensive to approximate the Hessian for large chained equations!

Can we improve the first order methods?

Better First Order Methods

We previously saw one big speed problem

  • The first order method works well at the beginning when the loss function is decreasing rapidly

  • Slows to a halt in flat regions

  • The gradient is small, so the change from \(t\) to \(t + 1\) is also small.

Maybe we can speed it up in the flat regions to add some speed?

Momentum

A heuristic method that has good properties:

Let

\[\mathbf m_{t + 1} = b \mathbf m_t + \mathbf g(\bb_t)\]

\[\bb_{t + 1} = \bb_t - \eta_t \mathbf m_t\]

where \(0 \le b \le 1\) is a constant.

This is called momentum

Momentum

Momentum

In each iteration, we are moving along the same direction

  • In other words, the gradient at \(\bb_t\) and \(\bb_{t - 1}\) are essentially the same.

Even though we know that we’re probably going to keep moving in the same direction at the next step, we are only moving \(\eta\) forward each time!

  • Slooooooooowwwwwwwwwwlllllyyyyyy

Momentum

\[\mathbf m_{t + 1} = b \mathbf m_t + \mathbf g(\bb_t)\]

At time \(t + 1\):

\[\mathbf m_{t + 1} = b \mathbf m_{t} + \mathbf g(\bb_t)\]

\[\mathbf m_{t + 1} = b^2 \mathbf m_{t - 2} + b \mathbf g(\bb_{t - 1}) + \mathbf g(\bb_t)\]

\[\mathbf m_{t + 1} = b^3 \mathbf m_{t - 3} + b^2 \mathbf g(\bb_{t - 2}) + b \mathbf m_{t - 1} + \mathbf g(\bb_t)\]

\[\mathbf m_{t + 1} = \sum \limits_{h = 1}^t b^h \mathbf g(\bb_{t - h + 1})\]

  • A decaying weighted average of the previous gradients!

Momentum

How does this speed things up in the flat minimum case?

The gradient in this space remains constant

\[\mathbf m_{t + 1} = \mathbf g(\bb_{t}) \sum \limits_{h = 1}^t b^h\]

Assuming that this sequence continues indefinitely:

\[\sum \limits_{h = 1}^\infty b^h = \frac{1}{1 - b}\]

Letting \(b = .9\), as we stay on the same gradient for a while, this gets closer and closer to 10

  • Meaning that we’re moving \(10\eta\) in a step instead of just \(\eta\)!

Momentum

This increase in speed isn’t uniform, though!

\[\mathbf g(\bb_t)^T = [-1,3] \text{ ; } \mathbf g(\bb_{t - 1})^T = [1,3]\]

\[\mathbf m_{t + 1} = \mathbf g(\bb_t) + b \mathbf g(\bb_{t - 1}) \sum \limits_{h = 1}^{t - 1} b^h \mathbf g(\bb_{t - h + 1})\]

with \(b\) sufficiently close to one:

\[\mathbf g(\bb_t) + b \mathbf g(\bb_{t - 1}) \approx [0,3]^T\]

  • The sharp change in the gradient was shrunk back towards zero

  • The direction that had previously been seen continues to increase in step size

Momentum

Think of it like a heavy ball or a rolling stone

  • If the hill is always in the same direction, keep rolling faster!

  • If the hill has lots of sharp changes, hit the wall and stop!

Momentum

Momentum

Momentum

It’s really fast!

  • Doesn’t waste time when we’re moving in a good direction.

Cons:

  • If the loss function is really irregular, then it’ll just move incredibly slow for all eternity

  • Requires storing two \(P\)-vectors: \(\mathbf m\) and \(\mathbf g(\bb_t)\). Modern machines with plenty of RAM won’t have a problem with this, though!

Stochastic Gradient Descent

Will second order methods and/or momentum work with smaller batch sizes?

  • Thoughts?

  • Do you think that the average Hessian converges as quickly in \(N\) compared to the gradient?

  • With batch size 1, do you expect that the gradient will ever really build much momentum?

Stochastic Gradient Descent

Problems unique to SGD:

  • Variance around the true minimum

  • High sensitivity to learning rate

  • The Hessian isn’t going to converge that quickly due to the small batch sizes and high dimensionality

Stochastic Gradient Descent

One solution is to create a learning rate schedule that decays over time.

  • This helps reduce sensitivity to the learning rate

Problem:

It’s a preset schedule that doesn’t adapt to the specifics of the loss function

  • What’s good for one problem may not be good for another problem

Stochastic Gradient Descent

Another thought is to use the Hessian

  • Too computationally expensive and too noisy to really approximate \(P^2\) terms

Maybe the idea of multiplying by the inverse Hessian is useful, though?

Hessians, Curvature, and Variance

\[-\mathbf H(\hat{\boldsymbol \Theta})^{-1}\]

  • In statistical theory, called the Fisher information

  • Associated with the variance of the parameters at a particular point in the parameter space

  • Can think of the inverse Hessian as a variance covariance matrix

Hessians, Curvature, and Variance

From a calculus perspective:

  • The diagonal of the inverse Hessian tells us how sharp the loss surface is with respect to a specific variable

  • A small diagonal element (small variance) means that a small change to \(\hat{\beta}_j\) will result in a large change in the loss

  • A large diagonal element (large variance) means that a small change to \(\hat{\beta}_j\) will result in a small change in the loss

  • The off diagonals tell us the covariance between the two parameters

Hessians, Curvature, and Variance

A thought:

If we want to choose the move of size \(\eta\) that will most decrease the loss in the parameter space, we want to set the descent direction mostly to parameters that have small diagonals in the inverse Hessian!

This is how second order methods race through the loss space

  • Make big changes on parameters that have small diagonal inverse Hessian elements

  • Make small changes on parameters that have big diagonal inverse Hessian elements

Hessians, Curvature, and Variance

If we precondition the gradient on the Hessian, then we move through the space in an optimal way!

But, the Hessian is prohibitively espensize to calculate/hard to approximate using SGD

The BHHH approximation to the Hessian

\[\mathbf H(\bb_t) \approx \mathbf g(\bb_t) \mathbf g(\bb_t)^T\]

  • Often called the Outer Product of the Gradient Approximation

  • Arises as a consequence of MLEs and Fisher information

Hessians, Curvature, and Variance

\[\mathbf H(\bb_t) \approx \mathbf g(\bb_t) \mathbf g(\bb_t)^T\]

This approximation only reaches equality at a point in the loss space where the gradient is equal to zero.

But, it is sort of close elsewhere.

The key thing here is that we already have the gradient evaluated at a point for gradient descent procedures!

  • This approximation is just a function of the gradient

Hessians, Curvature, and Variance

Even more clever:

Let’s just ignore the off-diagonals of the Hessian and concentrate on the diagonal

  • Evaluate how much curvature there is for each parameter ignoring cross-products

With the BHHH approximation:

\[\tilde{\mathbf H}(\bb_t) = \begin{bmatrix} \mathbf g(\beta_{1t})^2 & 0 & 0 & \ldots \\ 0 & \mathbf g(\beta_{2t})^2 & 0 & \ldots \\ 0 & 0 & \mathbf g(\beta_{3t})^2 & \ldots \\ \vdots & \vdots & \vdots & \ddots \end{bmatrix} \]

Hessians, Curvature, and Variance

Approximating the second order search:

\[ \bb_{t + 1} = \bb_t - \eta \tilde{\mathbf H}(\bb_t)^{-1} \mathbf g(\bb_t)\]

With a few more adjustments to handle numerical instability and to take advantage of SGD, we get the AdaGrad algorithm:

\[ \bb_{t + 1} = \bb_t - \eta \frac{\mathbf g(\bb_t)}{\sqrt{\mathbf s_{t} + \epsilon}}\]

where

\[\mathbf s_t = \sum \limits_{h = 1}^t \text{diag}\left(\tilde{\mathbf H}(\bb_h)\right) = \sum \limits_{h = 1}^t \mathbf g(\bb_h)^2\]

and \(\epsilon\) is a small positive value to prevent division by zero.

Hessians, Curvature, and Variance

AdaGrad has an adaptive learning rate for the gradient

  • We set the initial learning rate

  • Have higher learning rates when the parameter values are associated with high curvature (low inverse of the gradient)

  • Diminish the learning rate over time - the running sum of the squared gradient values

Think of this as a smarter learning rate decay schedule that doesn’t require choosing a decay rate

  • It just does it itself.

Sort of like momentum for the “Hessian”

Hessians, Curvature, and Variance

Note: AdaGrad can use a stopping criterion

  • In this case, it’s fine and we will to prove a point

Not always a good idea!

  • It depends on how the batches are drawn for SGD!

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Adaptive Gradients

It’s quite quick!

  • Still a little inefficient.

  • Spends most of its time tiptoeing to the minimum

This is partially because dividing by the sum of the squared gradient values goes to zero too quickly!

Adaptive Gradients

An improvement: RMSProp

\[ \bb_{t + 1} = \bb_t - \eta \frac{\mathbf g(\bb_t)}{\sqrt{\mathbf s_{t} + \epsilon}}\]

where

\[\mathbf s_t = a \mathbf s_{t - 1} + (1 - a) \mathbf g(\bb_t)^2\]

where \(0 \le a \le 1\).

Instead of a raw average, implement a decaying moving average of the approximate diagonal of the Hessian

  • Allows the method to not move to step size zero so quickly!

Adaptive Gradients

Why not combine this approach of adaptive gradients with momentum?

The Adaptive Moment method (ADAM):

\[ \bb_{t + 1} = \bb_t - \eta \frac{\mathbf m_t}{\sqrt{\mathbf s_{t} + \epsilon}}\]

\[\mathbf m_t = b_1 \mathbf m_{t - 1} + (1 - b_1) \mathbf g(\bb_t)\]

\[\mathbf s_t = b_2 \mathbf s_{t - 1} + (1 - b_2) \mathbf g(\bb_t)^2\]

  • Let the gradient move like a heavy ball

  • Also let the “Hessian” move like a heavy ball

  • Get the benefits of momentum on both elements of the optimization problem!

Adaptive Gradients

ADAM implementation:

  • \(\eta\) is commonly set to .001 or .01. May need to be tuned if you’re having convergence problems.

  • \(b_1 = .9\)

  • \(b_2 = .999\)

No real reason other than these values seem to perform well in most settings.

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

AdaGrad, RMSProp, and ADAM:

  • Do not require really tuning the initial learning rate

  • Converge in to the minimum with relatively low noise

  • Converge into the minimum faster than standard SGD

Different problems will benefit from different approaches!

  • ADAM is sort of the default optimizer

  • The other two are a little less computationally intense (they don’t need to carry \(\mathbf m\) and \(\mathbf s\))

  • Work well for simple problems. ADAM works much better in complicated high dimensional problems.

Stochastic Gradient Descent

Why do we care so much about these different optimization methods?

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

These methods are the building blocks of different deep learning methods!

  • As long as we can compute the gradient evaluated at the unknowns, we can minimize our loss function

  • As we’ll soon see, neural networks have a lot of parameters. Many of them will be ill-conditioned and not yield clean minima.

  • The quality of a deep learning classifier lives and dies by our ability to effectively minimize the loss function

Stochastic Gradient Descent

SGD is not optimal for all problems!

  • The noise/time tradeoff differs for different methods

Regular ol’ linear methods should still use more standard optimizers

  • There are tricks to evaluating the gradient even when we can’t analytically solve for 0 that don’t require slow moves through the gradient

Next Steps

We’ve finished the set of foundational topics that I believe we need to cover to better understand way more complicated linear model structures

  • Generalization Theory

  • Likelihood and chained equations

  • Regularization

  • Minimization with no analytical solution

Topics that are not covered as in depth in the intro class!

Next Steps

Next class, we’ll start discussing universal approximators and feed forward neural networks

  • The feature extractor set up

  • Activation functions and similarities to splines

  • Multi-unit single layer networks as universal approximators

Next Steps

The goal of this class is to really dig into how and why neural networks and deep learning work

  • More statistical theory than an applications or computer science class